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The majority of machine learning algorithms assumes that objects are 
represented as vectors. But often the objects we want to learn on are 
more naturally represented by other data structures such as sequences and 
time series. For these representations many standard learning algorithms 
are unavailable. We generalize gradient-based learning algorithms to time 
series under dynamic time warping. To this end, we introduce elastic func¬ 
tions, which extend functions on time series to matrix spaces. Necessary 
conditions are presented under which generalized gradient learning on time 
series is consistent. We indicate how results carry over to arbitrary elas¬ 
tic distance functions and to sequences consisting of symbolic elements. 
Specifically, four linear classifiers are extended to time series under dy¬ 
namic time warping and applied to benchmark datasets. Results indicate 
that generalized gradient learning via elastic functions have the potential to 
complement the state-of-the-art in statistical pattern recognition on time 
series. 


1. Introduction 

Statistical pattern recognition on time series finds many applications in diverse do¬ 
mains such as speech recognition, medical signal analysis, and recognition of gestures 
im m- A challenge in learning on time series consists in filtering out the effects of 
shifts and distortions in time. A common and widely applied approach to address 
invariance of shifts and distortions are elastic transformations such as dynamic time 
warping (DTW). Following this approach amounts in learning on time series spaces 
equipped with an elastic proximity measure. 

In comparison to Euclidean spaces, mathematical concepts such as the derivative of a 
function and a well-defined addition under elastic transformations are unknown in time 
series spaces. Therefore gradient-based algorithms can not be directly applied to time 
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series. The weak mathematical structure of time series spaces bears two consequences: 
(a) there are only few learning algorithms that directly operate on time series under 
elastic transformation; and (b) simple methods like the nearest neighbor classifier 
together with the DTW distance belong to the state-of-the-art and are reported to be 
exceptionally difficult to beat mmm- 

To advance the state-of-the-art in learning on time series, first adaptive methods 
have been proposed. They mainly devise or apply different measures of central ten¬ 
dency of a set of time series under dynamic time warping mmmm- The individual 
approaches reported in the literature are k-means m m eu m Isa, self-organizing 
maps ( 25] , and learning vector quantization 125] . These methods have been formulated 
in a problem-solving manner without a unifying theme. Consequently, there is no link 
to a mathematical theory that allows us to (1) place existing adaptive methods in a 
proper context, (2) derive adaptive methods on time series other than those based on 
a concept of mean, and (3) prove convergence of adaptive methods to solutions that 
satisfy necessary conditions of optimality. 

Here we propose generalized gradient methods on time series spaces that combine 
the advantages of gradient information and elastic transformation such that the above 
issues (l)-(3) are resolved. The key idea behind this approach is the concept of elastic 
function. Elastic functions extend functions on Euclidean spaces to time series spaces 
such that elastic transformations are preserved. Then learning on time series amounts 
in minimizing piecewise smooth risk functionals using generalized gradient methods 
proposed by OQB]. Specifically, we investigate elastic versions of logistic regression, 
(margin) perceptron learning, and linear support vector machine (SVM) for time series 
under dynamic time warping. We derive update rules and present different convergence 
results, in particular an elastic version of the perceptron convergence theorem. Though 
the main treatment focuses on univariate time series under DTW, we also show under 
which conditions the theory also holds for multivariate time series and sequences with 
non-numerical elements under arbitrary elastic transformations. 

We tested the four elastic linear classifiers to all two-class problems of the UCR 
time series benchmark dataset m- The results show that elastic linear classifiers on 
time series behave similarly to linear classifiers on vectors. Furthermore, our findings 
indicate that generalized gradient learning on time series spaces have the potential 
to complement the state-of-the-art in statistical pattern recognition on time series, 
because the simplest elastic methods are already competitive with the best available 
methods. 

The paper is organized as follows: Section 2 introduces background material. Section 
3 proposes elastic functions, generalized gradient learning on sequence data, and elastic 
linear classifiers. In Section 4, we relate the proposed approach to previous approaches 
on averaging a set of time series. Section 5 presents and discusses experiments. Finally, 
Section 6 concludes with a summary of the main results and an outlook for further 
research. 
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2. Background 

This section introduces basic material. Section |2.1| defines the DTW distance, Sec¬ 
tion |2.2| presents the problem of learning from examples, and Section |2.3| introduces 
piecewise smooth functions. 

2.1. Dynamic Time Warping Distance 

By [n] we denote the set {1,..., n} for some n £ N. A time series of length n is an 
ordered sequence x = (aq,... ,x n ) with features Xi £ R. sampled at discrete points of 
time i £ [n]. 

To define the DTW distance between time series x and y of length n and m, resp., we 
construct a grid Q = [n] x [to]. A warping path in grid Q is a sequence <f> = 
consisting of points 4 = ( ik,jk) £ G such that 

1. ti = (1,1) and t p = (n, to) (boundary conditions) 

2. 4 + i - 4 € {(1,0), (0,1), (1,1)} (warping conditions) 

for all 1 < k < p. 

A warping path <f> defines an alignment between sequences x and y by assigning 
elements Xi of sequence x to elements yj of sequence y for every point ( i,j) £ </>. 

The boundary condition enforces that the first and last element of both time series 

are assigned to one another accordingly. The warping condition summarizes what 
is known as the monotonicity and continuity condition. The monotonicity condition 
demands that the points of a warping path are in strict ascending lexicographic order. 
The continuity condition defines the maximum step size between two successive points 
in a path. 

The cost of aligning x = (aq,..., x n ) and y = (jq,..., y m ) along a warping path <f> 
is defined by 

d<j>(x, y) = ^ 

where c(xi,yj) is the local transformation cost of aligning features sq and yj. Un¬ 
less otherwise stated, we assume that the local transformation costs are given by 
c (aq, yj) = (aq — yj) 2 . Then the distance function 


d(x, y) = mm yjd^x, y), 

is the dynamic time warping (DTW) distance between x and y, where the minimum 
is taken over all warping paths in Q. 

2.2. The Problem of Learning 

We consider learning from examples as the problem of minimizing a risk functional. 
To present the main ideas, it is sufficient to focus on supervised learning. 
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Consider an input space X and output space y. The problem of supervised learning 
is to estimate an unknown function /* : X —> y on the basis of a training set 

v = {(xi,yi),■■■,{ xn,vn)} c x x y, 

where the examples ( x*, y* ) G X x y are drawn independent and identically distributed 
according to a joint probability distribution P(x,y) on X x y. 

To measure how well a function / : X —> y predicts output values y from x, we 
introduce the risk 

R[f] = [ t(y,f{x))dP(x,y), 

Jxxy 

where £ : y x y — > M + is a loss function that quantifies the cost of predicting f(x) 
when the true output value is y. 

The goal of learning is to find a function / : X —» y that minimizes the risk. The 
problem is that we can not directly compute the risk of /, because the probability 
distribution P(x,y) is unknown. But we can use the training examples to estimate 
the risk of / by the empirical risk 

i N 

Rn[I] = 

V i -1 

The empirical risk minimization principle suggests to approximate the unknown func¬ 
tion /* by a function 

f N = argmin R N [f] 

that minimizes the empirical risk over a fixed hypothesis space T C y x of functions 
,/ :-V ->J. 

Under appropriate conditions on X, y, and J 7 , the empirical risk minimization 
principle is justified in the following sense: (1) a minimizer /jy of the empirical risk 
exists, though it may not be unique; and (2) the risk R[fN] converges in probability 
to the risk i?[/»] of the best but unknown function /* when the number N of training 
examples goes to infinity. 

2.3. Piecewise Smooth Functions 

A function / : X —► K defined on a Euclidean space X is piecewise smooth, if / 
is continuous and there is a finite collection of continuously differentiable functions 
IZ(f) = {fi : X —> R : i£l} indexed by the set X such that 

f(x) G {fi(x) : 161} 

for all x G X. We call the collection 7 Z(f) a representation for f. A function fi G 
7 Z(f) satisfying fi(x) = f(x) is an active function of / at x. The set A(f,x) = 
{i G X : fi(x) = f(x)} is the active index set of / at x. By 

df(x) = {Vfi(x) : i G A{f,x)} 
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we denote the set of active gradients V fi(x) of active function at x. Active gradients 
are directional derivatives of /. At differentiable points x the set of active gradients is 
of the form df(x ) = {V/(x)}. 

Piecewise smooth functions are closed under composition, scalar multiplication, fi¬ 
nite sums, pointwise max- and min-operations. In particular, the max- and min- 
operations of a finite collection of differentiable functions allow us to construct piece- 
wise smooth functions. Piecewise functions / are non-differentiable on a set of Lebesgue 
measure zero, that is / is differentiable almost everywhere. 


3. Generalized Gradient Learning on Time Series Spaces 


This section generalizes gradient-based learning to time series spaces under elastic 
transformations. We first present the basic idea of the proposed approach in Sec¬ 


tion 3.1 Then Section 3.2 introduces the new concept of elastic functions. Based on 


this concept, Section 3.3 describes supervised generalized gradient learning on time 
series. As an example, Section |3A| introduces elastic linear classifiers. In Section |3A| 
we consider unsupervised generalized gradient learning. Section |3.6| sketches consis¬ 
tency results. Finally, Section |X7| generalizes the proposed approach to other elastic 
proximity functions and arbitrary sequence data. 


3.1. The Basic Idea 

This section presents the basic idea of generalized gradient learning on time series. 
For this we assume that Tx is a hypothesis space consisting of functions F : X — > K. 
defined on some Euclidean space X. For example, Tx consists of all linear functions 
on X. First we show how to generalize functions F £ Tx defined on Euclidean spaces 
to functions / : T —> K on time series such that elastic transformations are preserved. 
The resulting functions / are called elastic. Then we turn the focus on learning 
an unknown elastic function over the new hypothesis space Tj- of elastic functions 
obtained from Tx- 

We define elastic functions / : T — > K. on time series as a pullback of a function 
F £ Tx by an embedding y : T —> X, that is f{x) = F{y(x)) for all time series 
x £ T- 

In principle any injective map y can be used. Here, we are interested in embeddings 
that preserve elastic transformations. For this, we select a problem-dependent base 
time series z £ T- Then we define an embedding y z : T — >■ X that is isometric with 
respect to z, that is 

d(x,z) = \\ii z {x) — y z {z)\\ 

for all x £ T . It is important to note that an embedding y z is distance preserving 
with respect to z, only. In general, we will have d(x,y) < \\y z (x) — y z {y) || showing 
that an embedding y z will be an expansion of the time series space. This form of a 
restricted isometry turns out to be sufficient for our purposes. We call the pullback 
/ = f o/i of -F by y elastic, if embedding ji preserves elastic distances with respect to 
some base time series. Figure [l] illustrates the concept of elastic function. 
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Next we show how to learn an unknown elastic function by risk minimization over 
the hypothesis space Tj- consisting of pullbacks of functions from Tx by /j. For this 
we assume that ©r is a set of parameters and the hypothesis space Tr consists of 
functions fg with parameter 6 £ @ 7 -. To convey the basic idea, we consider the simple 
case that the parameter set is of the form @ 7 - = T. Then the goal is to minimize a 
risk functional 


mini?[01 (1) 

eer 

as a function of 9 £ T. We cast problem ([l]) to the equivalent problem 

min R[m(6)], (2) 

Observe that the risk functional of problem © is a function of elements n(0) from the 
Euclidean space A. Since problem (|2j) is analytically difficult to handle, we consider 
the relaxed problem 

mini?[©]. (3) 

®GX 

where the minimum is taken over the whole set X , whereas problem ([ 2 ]) minimizes 
over the subset /r(T) C X. The relaxed problem © is not only analytically more 
tractable but also learns a model from a larger hypothesis space and may therefore 
provide better asymptotical solutions, but may require more training data to reach 
acceptable test error rates [26J. 

3.2. Elastic Functions 

This section formally introduces the concept of elastic function, which generalize func¬ 
tions on matrix spaces X = M nxm to time series spaces. The matrix space X is the 
Euclidean space of all real (n x m)-matrices with inner product 

^ ^ •X'ij ' Vij • 

for all X, Y £ X. The inner product induces the Euclidean norm 

\\X\\ = y/{X^) 

also known as the Frobenius normj^] The dimension n x m of X has the following 
meaning: the number n of rows refers to the maximum length of all time series from 
the training set T>. The number m of columns is a problem dependent parameter, 
called elasticity henceforth. A larger number m of columns admits higher elasticity 
and vice versa. 

We first define an embedding from time series into the Euclidean space X. We 
embed time series into a matrix from X along a warping path as illustrated in Figure 

1 We call ||X|| Euclidean norm to emphasize that we regard X as a Euclidean space. 
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Figure 1: Illustration of elastic function / : T —> R of a function F : X — > M. The 
map n = embeds time series space T into the Euclidean space X. Corre¬ 
sponding solid red lines indicate that distances between respective endpoints 
are preserved by /i. Corresponding dashed red lines show that distances be¬ 
tween respective endpoints are not preserved. The diagram commutes, that 
is f(x) = F(/x( x)) is a pullback of F by /u. 


[2} Suppose that x = (xi,..., Xk) is a time series of length k < n. By V(x) we denote 
the set of all warping paths in the grid Q = [k] x [m\ defined by the length k of x 
and elasticity m. An elastic embedding of time series x into matrix Z = ( Zij ) along 
warping path <fi £ V{x) is a matrix x ®<j,Z = ( Xij ) with elements 


x = { Xi : 

* J " \ Zij : otherwise 

Suppose that F : X — > K is a function defined on the Euclidean space X. An elastic 
function of F based on matrix Z is a function / : T —> R with the following property: 
for every time series x £ T there is a warping path 4> £ V(x) such that 


f(x) = F(x <g >0 Z). 


The representation set and active set of / at x are of the form 


7 Z(f, x) = {F(x Z) : (j) £ V(x)} 

A(f, x) = {(/> £ V(x) : f(x) = F(x < 84 , Z)} . 


The definition of elastic function corresponds to the properties described in Section 
|3.1|and in Figure [T] To see this, we define an embedding ji z : T —^ X that first selects 
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Figure 2: Embedding of time series x = (2,4,3,1) into matrix Z along warping path 
<f>. From left to right: Time series x, grid Q with highlighted warping path 
<f>, matrix Z , and matrix x Z obtained after embedding x into Z along 
</>. We assume that the length of the longest time series in the training set is 
n = 7. Therefore the matrix Z has n = 7 rows. The number m of columns 
of Z is a problem dependent parameter and set to m = 5 in this example. 
Since time series x has length k = 4, the grid Q = [k] x [to] containing all 
feasible warping paths consists of 4 rows and 5 columns. Grids Q vary only 
in the number k of rows in accordance with the length k < n of the time 
series to be embedded, but always have m columns. 


for every time series x an active warping path </> £ A(f , x) and then maps x to the 
matrix pz(x) = x®^Z. Then we have F(p-z(x)) = f(x) for all x £ T. Suppose that 
the rows of matrix Z are all equal to z. Then = fiz is isometric with respect to z. 

Next, we consider examples of elastic functions. The first two examples are funda¬ 
mental for extending a broad class of gradient-based learning algorithms to time series 
spaces. 


Example 1 (Elastic Euclidean Distance) Let Y £ X. Consider the function 

D y :X->R+, X ^ \\X -Y\\ 


Then 


5y : T —t R+, x i->- min \\x <Zt/>Y — W||, 
is an elastic function of Dy- To see this, observe that from 

S Y (x) = min lias — Y II = min Dy{x®a>Y) 

<j>EV(x) (f)EV(x) 

follows i 5y{x) £ 1Z(5y,x) = {Dy{x igt^Y) : <j> £ V(x)}. See Figure^for an illustra¬ 
tion. We call d Y elastic Euclidean distance with parameter Y. ■ 
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Figure 3: Elastic Euclidean distance 8y(x). From left to right: time series x = 
(2,4,3,1), matrix Y, matrix x g ) ( j ) Y obtained by embedding x into matrix 
Y along optimal warping path (f >, and distance computation by aggregating 
the local costs giving Sy(x) = \/l9. The optimal path is highlighted in or¬ 
ange in Y and in x g^lG Gray shaded areas in both matrices refer to parts 
that are not used, because the length k = 4 of x is less than n = 7. Since 
x is embedded into Y only elements lying on the path </> contribute to the 
distance. All other local cost between elements of Y and x g^P are zero. 


Example 2 (Elastic Inner Product) Let W £ X. Consider the function 

S w : *->R, 14 (X,W) 


Then the function 


cr\v : T — > R, a: i—>• max (x g^ 0, W), 

4>£V{x) 

is an elastic function of S\y, called elastic inner product with parameter W. ■ 

The elastic Euclidean distance and elastic inner product are elastic proximities 
closely related to the DTW distance, where the elastic Euclidean distance general¬ 
izes the DTW distance. The time and space complexity of both elastic proximities are 
0(nm). If no optimal warping path is required, space complexity can be reduced to 
0(max(n,m)). To see this, we refer to Algorithm [l] To obtain an optimal warping 
path, we can trace-back along the score matrix S in the usual way. The procedure in 
Algorithm [T] applies exactly the same dynamic programming scheme as the one for the 
standard DTW distance and therefore has the same time and space complexity. 

Observe that both elastic proximities embed time series into different matrices. Elas¬ 
tic Euclidean distances embed time series into the parameter matrix and elastic inner 
products always embed time series into the zero-matrix 0. 
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Algorithm 1 (Elastic Inner Product) 

Input: 

- time series x = ( x\,... ,Xk) with k < n 

- elasticity m 

- weight matrix W = (wij) E R nXm 

Procedure: 

Let S = ( Sij ) G R fcxm be the initial score matrix 

Sll <— X\W\\ 

for i = 2 to k do 

Sjl ^— Si_1,1 + X%'W%\ 

for j = 2 to m do 

Slj ^ Si ? j — 1 X\W\j 

for i = 2 to k do 

for j = 2 to m do 

Sij — XiWij “t - ma,X — 1 — l,Sj— 1 t j — 1 } 

Return: 

— CTwC®) ~ s km 

Remark: This algorithm can also be used to compute elastic Euclidean distances. For 
this, replace all products XjWij by squared costs (xj — w t j ) and the max-operation by 
a min-operation. 


Example 3 (Elastic Linear Function) Let 0 = X x K. be a set of parameters and 
let 6 = (W, b) € 0 be a parameter. Consider the linear function 

Fg-.X^R, X b + S W (X) =b+(X, W), 

where W is the weight matrix and b is the bias. The function 

fe : T -> R, x b + aw(x), 

is an elastic function of Fg, called elastic linear function. ■ 

Example 4 (Single-Layer Neural Network) Let 0 = X r x R 2r+1 be a set of pa¬ 
rameters. Consider the function 

r 

/e : T 1, x h> b + y^Wia(fi(x)), 

i=1 

where a(z) is a sigmoid function, ft = fg i are elastic linear functions with parameters 
9i = ( Wi,bi), and 6 = (0i,...,9 r , W\, ..., w r , b). The function fg implements an 
elastic neural network for time series with r sigmoid units in the hidden layer and a 
single linear unit in the output layer. ■ 
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3.3. Supervised Generalized Gradient Learning 

This section introduces a generic scheme of generalized gradient learning for time series 
under dynamic time warping. 

Let 0 = X r x K s be a set of parameters. Consider a hypothesis space T of 
functions fg : T —► y with parameter 9 = (Wi,... ,W r ,b) £ 0. Suppose that 
V = {(a;i,yi),..., {xn),un} CTxJlisa training set. According to the empirical 
risk minimization principle, the goal is to minimize 

N 

Rn[0} = Rn[ fo] = fe(x)) 

i =1 

as a function of 9. Since Rn is a function of 9, we rewrite the loss by interchanging 
the role of argument z = (*, y ) and parameter 6 such that 

4 : 0 -> R, 9 ^ t(v,fe{x)). (4) 

We assume that the loss £ z is piecewise smooth with representation set 

K{ 4) = {U : © —► K : $ = (0i,. • •, (f>r) G P r (a;)} 

indexed by r-tuples of warping paths from V{x). The gradient of an active 

function at 0 is given by 

dWi'"' dW~ r ’~db J’ 

where d£& / 89, denotes the partial derivative of t?# with respect to 9i. The incremental 
update rule of the generalized gradient method is of the form 

W: <«(»■) (5) 

—,(*(»') ( 6 ) 

discusses consistency of variants of update rule © and (j6j) . 


for all i £ [r]. Section 3.6 



3.4. Elastic Linear Classifiers 


Let y = {±1} be the output space consisting of two class labels. An elastic linear 
classifier is a function of the form 


hg : T —>■ y, x 


+1 

-1 


fo(x) > 0 
fe{x) < 0 


(7) 


where fg(x) = b + aw(x) is an elastic linear function and 9 = (W, b) summarizes the 
parameters. We assign a time series x to the positive class if fg(x) > 0 and to the 
negative class otherwise. 
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Elastic Logistic Regression 

logistic function gg{x) = 1/ (1 + exp (-fg(x)) 

^ = { 0 , 1 } 

loss function £ = -ylog(g e (x)) - (1 - y) log(l - gg(x)) 

partial derivative dw£ = — (y — gg(x))) ■ X 


Elastic Perceptron 

loss function £ = max{0, — y ■ fg(x)} 

y = {± 1 } 

partial derivative dw£ = —y ■ X ■ I{^>o} 


Elastic Margin Perceptron 

loss function £ = max {0, £ — y ■ fg(x)} 

y = {±1} 

partial derivative d\y£ = —y ■ X ■ I{^>o} 


Elastic Linear SVM 

loss function £ = A TV 2 +max{0,1 — y ■ fg{ x)} 

partial derivative dw£ = ~y • X • Ip>o} 

y = {±1} 


Table 1: Examples of elastic linear classifiers. By d\y£ we denote a partial derivative of 
an active function of £ with respect to W. The partial derivatives db£ coincide 
with their corresponding counterparts in vector spaces and are therefore not 
included. The matrix X = asCgi^O is obtained by embedding time series x into 
the zero-matrix 0 along active warping path <f>. The indicator function 
returns 1 if the boolean expression z is true and returns 0, otherwise. The 
elastic perceptron is a special case of elastic margin perceptron with margin 
£ = 0. The elastic linear SVM can be regarded as a special ^-regularized 
elastic margin perceptron with margin £ = 1. 


Depending on the choice of loss function £(y, fg(x)), we obtain different elastic 
linear classifiers as shown in Table [T] The loss function of elastic logistic regression is 
differentiable as a function of fg and b , but piecewise smooth as a function of W. All 
other loss functions are piecewise smooth as a function of fg, b and W. 

From the partial derivatives, we can construct the update rule of the generalized 
gradient method. For example, the incremental / stochastic update rule of the elastic 
perceptron is of the form 


W t+1 = W l + rf yX (8) 

b t+1 =b t + V t y, (9) 

where (x, y) is the training example at iteration t, and X = x 0 with <j> € A(£, x) . 
From the factor I{^>o} shown in Table [I] follows that the update rule given in ([8]) and 
([9]) is only applied when x is misclassined. 

We present three convergence results. A proof is given in Appendix [A] 

Convergence of the generalized gradient method. The generalized gradient method for 
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minimizing the empirical risk of an elastic linear classifier with convex loss converges 
to a local minimum under the assumptions of [5], Theorem 4.1. 

Convergence of the stochastic generalized gradient method. This method converges to 
a local minimum of the expected risk of an elastic linear classifier with convex loss 
under the assumptions of [5], Theorem 5.1. 

Elastic margin perceptron convergence theorem. The perceptron convergence theorem 
states that the perceptron algorithm with constant learning rate finds a separating 
hyperplane, whenever the training patterns are linearly separable. A similar result 
holds for the elastic margin perceptron algorithm. 

A finite training set V CJ~ x y is elastic-linearly separable, if there are parameters 
6 = (W,b) such that h e (x) = y for all examples (x,y) e V. We say, V is elastic- 
linearly separable with margin £ > 0 if 

min y (b + a(x,W)) > £. 

(x,y)eD 

Then the following convergence theorem holds: 

Theorem 1 (Elastic Margin Perceptron Convergence Theorem) Suppose that 
V C E x y is elastic-linearly separable with margin £ > 0. Then the elastic margin 
perceptron algorithm with fixed learning rate y and margin-parameter A < £ converges 
to a solution ( W,b ) that correctly classifies the training examples from V after a finite 
number of update steps, provided the learning rate is chosen sufficiently small. 

3.5. Unsupervised Generalized Gradient Learning 

Several unsupervised learning algorithms such as, for example, k-means, self-organizing 
maps, principal component analysis, and mixture of Gaussians are based on the con¬ 
cept of (weighted) mean. Once we know how to average a set of time series, extension 
of mean-based learning methods to time series follows the same rules as for vectors. 
Therefore, it is sufficient to focus on the problem of averaging a set of time series. 

Suppose that T> = {xi, ..., x n} C T is a set of unlabeled time series. Consider the 
sum of squared distances 


N 

F(Y) = min {\\xi ^ ( j >i Y — Y\\ 2 : &€?(*<)}. (10) 

i=i 

A matrix Y* that minimizes A is a mean of the set T> and the minimum value E* = 
E(Y*) is the variation of V. The update rule of the generalized gradient method is of 
the form 

N 

Y t+1 =Y t — rf ^ (Xi — Y f ), (11) 

i= 1 
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where X, = Xj Y* is the matrix obtained by embedding the i-th training example 
x, into matrix Y* along active warping path </>j. Under the conditions of [5], Theo¬ 
rem 4.1, the generalized gradient method for minimizing / using update rule © is 
consistent in the mean and variation. 

We consider the special case, when the learning rate is constant and takes the form 
rf = 1/N for all t> 0. Then update rule © is equivalent to 

Yt+1 = ^f>’ ( 12 ) 


where Xj is as in (11). 


3.6. A Note on Convergence and Consistency 

Gradient-based methods in statistical pattern recognition typically assume that the 
functions of the underlying hypothesis space is differentiable. However, many loss 
functions in machine learning are piecewise smooth, such as, for example, the loss of 
perceptron learning, k-means, and loss functions using fi-regularization. This case has 
been discussed and analyzed by [2|. 

When learning in elastic spaces, hypothesis spaces consist of piecewise smooth func¬ 
tions, which are pullbacks of smooth functions. Since piecewise smooth functions are 
closed under composition, the situation is similar as in standard pattern recognition, 
where hypothesis spaces consist of smooth functions. What has changed is that we will 
have ’’more” non-smooth points. Nevertheless, the set of non-smooth points remains 
negligible in the sense that it forms a set of Lebesgue measure zero. 

Piecewise smooth functions are locally Lipschitz and therefore admit a Clarke’s sub¬ 
differential Df at each point [3]. A Clarke’s subdifferential Df is a set that contains 
elements, called generalized gradients. At differentiable points, the Clarke subdiffer¬ 
ential coincides with the gradient, that is Df(x) = |V/(x)}. A necessary condition of 
optimality of / at x is 0 G Df{x). 

Using these and other concepts from non-smooth analysis, we can construct mini¬ 
mization procedures that generalize gradient descent methods. In previous subsections, 
we presented a slightly simpler variant of the following generalized gradient method: 
Consider the minimization problem 


min f(x), (13) 

where / is a piecewise smooth function and Z C X is a bounded convex constraint set. 
Let Z* denote the subset of solutions satisfying the necessary condition of optimality 
and /(Z*) = {f(x) : x £ Z*} is the set of solution values. Consider the following 
iterative method: 


x° £Z (14) 

x t+1 £ll z (x t -r 1 t -g t ), (15) 


14 



where g % £ Df{x t ) is a generalized gradient of / at a; 4 , II z is the multi-valued projec¬ 
tion onto Z and rf is the learning rate satisfying the conditions 


lim 77 = 0 

t—yco 


and 


zy= 


00 . 


(16) 


t=0 


The generalized gradient method (14) (16) minimizes a piecewise smooth function / by 


selecting a generalized gradient, performing the usual update step, and then projects 
the updated point to the constraint set Z. If / is differentiable at x 4 , which is almost 
always the case, then the update amounts to selecting an active index i £ A(f , x) 
of / at the current iterate x 4 and then performing gradient descent along direction 

-v/^ 4 ). 

Note that the constraint set Z has been ignored in previous subsections. We intro¬ 
duce a sufficiently large constraint set Z to ensure convergence. In a practical setting, 
we may ignore specifying Z unless the sequence (a; 4 ) accidentally goes to infinity. 

Under mild additional assumptions, this procedure converges to a solution satisfying 
the necessary condition of optimality [5], Theorem 4.1: The sequence ( x 4 ) generated 
by method (141-(161 converges to the solution of problem (131 in the following sense: 


1. the limits points x of (x 4 ) with minimum value f(x) are contained in Z *. 

2. the limits points f of (f(x t )) are contained in f(Z *). 


Consistency of the stochastic generalized gradient method for minimizing the ex¬ 
pected risk functional follows from [5j, Theorem 5.1, provided similar assumptions are 
satisfied. 


3.7. Generalizations 

This section indicates some generalizations of the concept of elastic functions. 

3.7.1. Generalization to other Elastic Distance Functions 

Elastic functions as introduced here are based on the DTW distance via embeddings 
along a set of feasible warping paths with squared differences as local transformation 
costs. The choice of distance function and local transformation cost is arbitrary. We 
can equally well define elastic functions based on proximities other than the DTW 
distance. Results on learning carry over whenever a proximity p on time series satisfies 
the following sufficient conditions: (1) p minimizes the costs over a set of feasible paths, 
(2) the cost of a feasible path is a piecewise smooth function as a function of the local 
transformation costs, and (3) the local transformation costs are piecewise smooth. 

With regard to the DTW distance, these generalizations include the Euclidean dis¬ 
tance and DTW distances with additional constraints such as the Sakoe-Chiba band 
[23] , Furthermore, absolute differences as local transformation cost are feasible, be¬ 
cause the absolute value function is piecewise smooth. 


15 







3.7.2. Generalization to Multivariate Time Series 

A multivariate time series is an ordered sequence x = (aq,. .., x n ) consisting of feature 
vectors aq £ R d . We can define the DTW distance between multivariate time series x 
and y as in the univariate case but replace the local transformation cost c(xi,yj) = 
{Xi - yj) 2 by c(xi,yj) = ||aq - yj 2 . 

To define elastic functions, we embed multivariate time series into the set X = 
(R d ) nxrra of vector-valued matrices X = ( Xij ) with elements Xij £ R d . These ad¬ 
justment preserve piecewise smoothness, because the Euclidean space A is a direct 
product of lower-dimensional Euclidean spaces. 


3.7.3. Generalization to Sequences with Symbolic Attributes 

We consider sequences x = (aq,..., x n ) with attributes aq from some finite set A 
of d attributes (symbols). Since A is finite, we can represent its attributes a € A 
by d-dimensional binary vectors a £ {0,1} , where all but one element is zero. The 
unique non-zero element has value one and is related to attribute a. In doing so, we 
can reduce the case of attributed sequences to the case of multivariate time series. 

We can introduce the following local transformation costs 


c{xuyj) 


0 : Xi = yj 
1 : Xi 7 ^ yj 


More generally, we can define local transformation costs of the form 


c(xi, yj) = k{xi,Xi) - 2 k(xi,yj) + k(yj,yj), 

where k : A x A —> R is a positive-definite kernel. Provided that the kernel is an inner 
product in some finite-dimensional feature space, we can reduce this generalization 
also to the case of multivariate time series. 


4. Relationship to Previous Approaches 

Previous work on adaptive methods either focus on computing or are based on a 
concept of (weighted) mean of a set of time series. Most of the literature is summarized 
in nn nausea. To place those approaches into the framework of elastic functions, 
it is sufficient to consider the problem of computing a mean of a set of time series. 

Suppose that T> = {aq,..., x jv} is a set of time series. A mean is any time series y* 
that minimizes the sum of squared DTW distances 

N 

f(y) = ^2d 2 (xi,y). 

i= 1 

Algorithm [2] outlines a unifying minimization procedure of /. The set Z in line 1 of 
the procedure consists of all matrices with n identical rows, where n is the maximum 
length of all time series from V. Thus, there is a one-to-one correspondence between 
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Algorithm 2 (Mean Computation) 

Input: 

- sample T> = {* 1 ,..., x n} C 7~ 

Procedure: 

1. initialize 

2. repeat 

2.1. determine active warping paths (f)i that embed Xi into Y 

2.2. update Y v(Y , sci, • • • , *jv> <£i, • • •, </>tv) 

2.3. project V 7 t(Y’) to Z 
until convergence 

Return: 

- approximation y of mean 


time series from T and matrices from the subset Z. By construction, we have f(y) = 
F(Y ), where Y £ Z is the matrix with all rows equal to y and F(Y) is as defined in 
eq. (101. 

In line 2.1, we determine active warping paths of the function F(Y) that embed Xj 
into matrix Y. By construction this step is equivalent to computing optimal warping 
paths for determining the DTW distance between x % and y. Line 2.2 updates matrix 
Y and line 2.3 projects the updated matrix Y to the set Z. The last step is equivalent 
to constructing a time series from a matrix. 

Previous approaches differ in the form of update rule v in line 2.2 and the projection 
7 r in line 2.3. Algorithmically, steps 2.2 and 2.3 usually form a single step in the 
sense that the composition i/j = tt o v can not as clearly decomposed in two separate 
processing steps as described in Algorithm [2j The choice of v and 7r is critical for 
convergence analysis. Problems arise when the map v does not select a generalized 
gradient and the projection n does not map a matrix from A to a closest matrix from 
y. In these cases, it may be unclear how to define necessary conditions of optimality 
for the function /. As a consequence, even if steps 2.2 and 2.3 minimize /, we do not 
know whether Algorithm [2] converges to a local minimum of /. The same problems 
arise when studying the asymptotic properties of the mean as a minimizer of /. 

The situation is different for the function F defined in eq. ( fl0| ). When minimizing 
F, the set Z coincides with X. Since the function F is piecewise smooth, the map 
v in line 2.2 corresponds to an update step of the generalized gradient method. The 
projection 7r in line 2.3 is the identity. Under the conditions of [5], Theorem 4.1 and 
Theorem 5.1 the procedure described in Algorithm [2] is consistent. 
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Dataset 

#(Train) 

#(Test) 

Length 

P 

Coffee 

28 

28 

286 

0.098 

ECG200 

100 

100 

96 

1.042 

ECGFiveDays 

23 

861 

136 

0.169 

Gun Point 

50 

150 

150 

0.333 

ItalyPowerDemand 

67 

1,029 

24 

2.792 

Lightning 2 

60 

61 

637 

0.094 

MoteStrain 

20 

1,252 

84 

0.238 

SonyAIBORobotSurface 

20 

601 

70 

0.286 

SonyAIBORobotSurfacell 

27 

953 

65 

0.415 

TwoLeadECG 

23 

1,139 

82 

0.280 

Wafer 

1,000 

6,174 

152 

6.579 

Yoga 

300 

3,000 

426 

0.704 


Table 2: Characteristic features of data sets for two-class classification problems. The 
last column shows the ratio p = length/#(train). 


5. Experiments 

The goal of this section is to assess the performance and behavior of elastic linear clas¬ 
sifiers.We present and discuss results from two experimental studies. The first study 
explores the effects of the elasticity parameter on the error rate and the second study 
compares the performance of different elastic linear classifiers. We considered two- 
class problems of the UCR time series datasets [TO]. Table [2] summarizes characteristic 
features of the datasets. 

5.1. Exploring the Effects of Elasticity 

The first experimental study explores the effects of elasticity on the error rate by 
controlling the number of columns of the weight matrix of an elastic perceptron. 

5.1.1. Experimental Setup. 

The elastic perceptron algorithm was applied to the GumPoint, ECG200, and ECG- 
FiveDays dataset using the following setting: The dimension of the matrix space X 
was set to n x to, where n is the length of the longest time series in the training set 
of the respective dataset. Bias and weight matrix were initialized by drawing random 
numbers from the uniform distribution on the interval [—0.01,+0.01]. The elasticity 
m was controlled via the ratio w = m/n. For every w £ S w the learning rate rj £ S v 
with the lowest error on the training set was selected, where the sets are of the form 

S w = {0,0.05,0.1,0.2,0.3,0.4,0.5,0.75,1.0,2.0,3.0} 

Sr, = {1.0,0.7,0.3,0.1,0.03,0.01,0.003,0.001} . 

Note that the value w = 0 refers to m = 1. Thus the weight matrix collapses to a 
column vector and the elastic perceptron becomes the standard perceptron. To assess 
the generalization performance, the learned classifier was applied to the test set. The 
whole experiment was repeated 30 times for every value w. 
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Figure 4: Mean error rates of elastic perceptron on Gun.Point, ECG200, and ECG- 
FiveDays. Vertical axes show the mean error rates in % averaged over 30 
trials. Horizontal axes show the ratio w = m/n , where m is the elasticity, 
that is the number of columns of the weight matrix and n is the length of 
the longest time series of the respective dataset. Ratio w = 0 means m = 1 
and corresponds to the standard perceptron algorithm. 


5.1.2. Results and Discussion 

Figure [4] shows the mean error rates of the elastic perceptron as a function of w = m/n. 
The error rates on the respective training sets were always zero. 

One characteristic feature of the UCR datasets listed in Table [2] is that the number 
of training examples is low compared to the dimension of the time series. This explains 
the low training error rates and the substantially higher test error rates. 

The three plots show typical curves also observed when applying the elastic percep¬ 
tron to the other datasets listed in Table [2j The most important observation to be 
made is that the parameter w is problem-dependent and need to be selected carefully. 
If the training set is small and dimensionality is high, a proper choice of w becomes 
challenging. The second observation is that in some cases, the standard perceptron 
algorithm (w = 0) may perform best as in ECGFiveDays. Increasing w results in a 
classifier with larger flexibiltiy. Intuitively this means that an elastic perceptron can 
implement more decision boundaries the larger w is. If w becomes too large, the clas¬ 
sifier becomes more prone to overfitting as indicated by the results on ECG200 and 
ECGFiveDays. We hypothesize that elasticity controls the capacity of an elastic linear 
classifier. 


5.2. Comparative Study 

This comparative study assesses the performance of elastic linear classifiers. 


5.2.1. Experimental Setup. 

In this study, we used all datasets listed in Table [2] The four elastic linear classifiers 
of Section 3.4 were compared against different variants of the nearest neighbor (NN) 
classifier with DTW distance. The variants of the NN classifiers differ in the choice 
of prototypes. The first variant uses all training examples as prototypes (NN+ALL). 
The second and third variant learned one prototype per class from the training set 
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using k-means (NN+KME) as second variant and agglomerative hierarchical clustering 
(NN+AHC) as third variant [15] . 

The settings of the elastic linear classifiers were as follows: The dimension of the 
matrix space X was set to n x m, where n is the length of the longest time series in 
the training set and m = [n/10] is the elasticity. The elasticity m was set to 10% of 
the length n for the following reasons: First, m should be small to avoid overfitting 
due to high dimensionality of the data and small size of the training set. Second, m 
should be larger than one, because otherwise an elastic linear classifier reduces to a 
standard linear classifier. 

Bias and weight matrix were initialized by drawing random numbers from the uni¬ 
form distribution on the interval [—0.01,+0.01]. Parameters were selected by /c-fold 
cross validation on the training set of size N. We set k = 10 if IV > 30 and k = N 
otherwise. The following parameters were selected: learning rate 77 for all elastic linear 
classifiers, margin £ for elastic margin perceptron, and regularization parameter A for 
elastic linear SYM. The parameters were selected from the following values 

V G {2- 10 , 2 -9 ,..., 2°}, £ G {10- 7 ,10- 6 ,..., 10 1 }, AS {2 —10 , 2" 9 ,..., 2" 1 }. 

The final model was obtained by training the elastic linear classifiers on the whole 
training set using the optimal parameter(s). We assessed the generalization perfor¬ 
mance by applying the learned model to the test data. Since the performance of 
elastic linear classifiers depends on the random initialization of the bias and weight 
matrix, we repeated the last two steps 100 times, using the same selected parameters 
in each trial. 

5.2.2. Results and Discussion. 

Table [ 3 ] summarizes the error rates of all elastic linear (EL) classifiers and nearest 
neighbor (NN) classifiers. 

Comparison of EL classifiers and NN methods is motivated by the following reasons: 
First, NN classifiers belong to the state-of-the-art and are considered to be exception¬ 
ally difficult to beat Second, in Euclidean spaces linear classifiers and nearest 

neighbors are two simple but complementary approaches. Linear classifiers are compu¬ 
tationally efficient, make strong assumptions about the data and therefore may yield 
stable but possibly inaccurate predictions. In contrast, nearest neighbor methods make 
very mild assumption about the data and therefore often yield accurate but possibly 
unstable predictions [5J. 

The first key observation suggests that overall generalization performance of EL 
classifiers is comparable to the state-of-the-art NN classifier. This observation is sup¬ 
ported by the same same number of green shaded rows (EL is better) and red shaded 
rows (NN is better) in Table [3] As reported by [12], ensemble classifiers of different 
elastic distance measures are assumed to be first approach that significantly outper¬ 
formed the NN+ALL classifier on the UCR time series dataset. This result is not 
surprising, because in machine learning it is well known for a long time that ensemble 
classifiers often perform better than their base classifiers for reasons explained in j^j. 
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Dataset 


eLSVM 


NN + DTW 
ALL AHC KME 


Lighting 2 

MoteStrain 

SonyAIBO 

TwoLeadECG 

Yoga 


ePERC 


Elastic linear 
eLOGR 


classifiers 

eMARG 


Coffee 

17.9 

25.0 

25.0 

4.6 ±3 ° 

4.5 ±2 - 9 

4_7 ±3.3 

3.1 ± 2 - 9 

ECG200 

23.0 

28.0 

28.0 

i3.6 ±18 

11.8 ±1(i 

13.6 ±19 

13.1 ±17 

ECGFiveDays 

23.2 

33.0 

33.0 

15.3 ±3 7 

15 7 ± 3 - 3 

15.3 ±3 ' 4 

H I ±3.0 

Italy PowDem. 

5.0 

21.5 

21.5 

3.8 ±12 

3 .0±°.3 

3.5 ±0 - 8 

3.0 4:0-3 

Wafer 

2.0 

69.5 

69.5 

1 3 ±0.3 

1.2 ±0 - 2 

l . O ±02 

i.o ±0 - 2 

Gun Point 

9.3 

32.7 

32.7 

g 7±3.6 

9.2 ±2 ' 5 

10.0 ±3 4 

9.0 ±2 8 

SonyAIBO II 

27.5 

21.6 

21.6 

27.0 ±3 8 

20.2 ±1 4 

26.6 ±3 3 

22.7 ±2,1 


13.1 

16.5 

16.9 

9.6 

16.4 


36.1 
13.3 
18.8 

16.2 
45.8 


36.1 
13.3 
18.8 

16.2 
45.8 


44.2 = 

17.2 = 

19.3 = 
22.7 = 
20.9 = 


44.1 = 
16.0 = 
18.6 = 
21.8 = 
21.5 = 


44.6 = 

17.6 = 
19.5 = 

21.7 = 


47.6 = 

15.8 = 

17.8 = 

21.8 = 
20.8 = 


Table 3: Mean error rates and standard deviation of elastic linear classifiers averaged 
over 100 trials and error rates of nearest-neighbor classifiers using the DTW 
distance (NN+DTW). ALL: NN+DTW with all training examples as proto¬ 
types; AHC: NN+DTW with one prototype per class obtained from agglom- 
erative hierarchical clustering with Ward linkage; KME: NN+DTW with one 
prototype per class obtained from k-means clustering; ePERC: elastic percep- 
tron; eLOGR: elastic logistic regression; eMARG = elastic margin perceptron; 
eLSVM: elastic linear SVM. Best (avg.) results are highlighted. Green rows: 
avg. results of all elastic linear classifiers are better than the results of all NN 
classifiers. Yellow rows: results of elastic linear classifiers and NN classifiers 
are comparable. Red rows: avg. results of all elastic linear classifiers are worse 
than the best result of an NN classifier. 


Since any base classifier can contribute to an ensemble classifier, it is feasible to restrict 
comparison to base classifiers such as the state-of-the-art NN+ALL classifier. 

The second key observation indicates that EL classifiers are clearly superior to NN 
classifiers with one prototype per class, denoted by NNi henceforth. Evidence for this 
finding is provided by two results: first, AHC and KME performed best among several 
prototype selection methods for NN classification [T5|; and second, error rates of EL 
classifiers are significantly better than those of NN+AHC and NN+KME for eight, 
comparable for two, and significantly worse for two datasets. 

The third key observation is that EL classifiers clearly better compromise between 
solution quality and computation time than NN classifiers. Findings reported by |27] 
indicate that more prototypes may improve generalization performance of NN clas¬ 
sifiers. At the same time, more prototypes increase computation time, though the 
differences will decrease for larger number of prototypes by applying certain accel¬ 
eration techniques. At the extreme ends of the scale, we have NN+ALL and NNi 
classifiers. With respect to solution quality, the first key observation states that EL 
classifiers are comparable to the slowest NN classifiers using the whole training set as 
prototypes and clearly superior to the fastest NN classifiers using one prototype per 
class. To compare computational efficiency, we first consider the case without apply¬ 
ing any acceleration techniques. We measure computational efficiency by the number 
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of proximity calculations required to classify a single time series. This comparison is 
justified, because the complexity of computing a DTW distance and an elastic inner 
product are identical. Then EL classifiers are p-times faster than NN classifiers, where 
p is the number of prototypes. Thus the fastest NN classifiers effectively have the same 
computational effort as EL classifiers for arbitrary multi-class problems, but they are 
not competitive to EL classifiers according to the second key observation. Next, we 
discuss computational efficiency of both types of classifiers, when one applies acceler¬ 
ation techniques. For NN classifiers, two common techniques to decrease computation 
time are global constraints such as the Sakoe-Chiba band ,23j and diminishing the 
number of DTW distance calculations by applying lower bounding technique [211, ,22]. 
Both techniques can equally well be applied to EL classifiers, where lower-bounding 
techniques need to be converted to upper-bounding techniques. Furthermore, EL clas¬ 
sifiers can additionally control the computational effort by the number m of columns 
of the matrix space. Here m was set to 10% of the length n of the shortest time 
series of the training set. The better performance of EL classifiers in comparison to 
NNi classifiers is notable, because the decision boundaries that can be implemented 
by their counterparts in the Euclidean space are both the set of all hyperplanes. We 
assume that EL classifiers outperform NNj classifiers, because learning prototypes by 
clustering minimizes a cluster criterion unrelated to the risk functional of a classifica¬ 
tion problem. Therefore the resulting prototypes may fail to discriminate the data for 
some problems. 

The fourth key observation is that the strong assumption of elastic-linearly separable 
problems is appropriate for some problems in the time series classification. Error rates 
of elastic linear classifiers for Coffee, ItalyPowerDemand, and Wafer are below 5%. For 
these problems, the strong assumption made by EL classifiers is appropriate. For all 
other datasets, the high error rates of EL classifiers could be caused by two factors: 
first, the assumption that the data is elastic-linearly separable is inappropriate; and 
second, the number of training examples given the length of the time series is too low 
for learning (see ratio p in Table [2]). Here further experiments are required. 

The fifth observation is that the different EL classifiers perform comparable with 
advantages for eLOGR and eLSVM. These findings correspond to similar findings for 
logistic regression and linear SVM in vector spaces. 

To complete the comparison, we contrast the time complexities of all classifiers re¬ 
quired for learning. NN+ALL requires no time for learning. The NN+AHC classifier 
learns a protoype for each class using agglomerative hierarchical clustering. Deter¬ 
mining pairwise DTW distances is of complexity 0[n 2 N{N — l)/2), where n is the 
length of the time series and N is the number of training examples. Given a pairwise 
distance matrix, the complexity of agglomerative clustering is 0(N 3 ) in the general 
case. Efficient variants of special agglomerative methods have a complexity of 0(N 2 ). 
Thus, the complexity of NN+AHC is 0(n 2 N 2 ) in the best and 0(n 2 N 2 + N 3 ) in the 
general case. The NN+KME learns a protoype for each class using k-means under 
elastic transformations. Its time complexity is 0(2n 2 AT), where t is the number of 
iterations required until termination. The time complexity for learning an EL clas¬ 
sifier is 0(nmNt ), where m is the number of columns of the weight matrix. This 
shows that the time complexity for learning an EL classifier is the same as learning 
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two prototypes by KME. However, in this setting, learning an EL classifier is about 
factor 20 faster than KME, under the assumption that the number of iterations t is 
the same for both methods. If the number N of training examples is large, NN+AHC 
becomes prohibitively slow. In contrast, the learning procedures of NN+KME and EL 
classifiers can be terminated after some pre-specified maximum number of iterations. 
In doing so, we trade solution quality against feasible computation time. 

To summarize, the results show that elastic linear classifiers are simple and efficient 
methods. They rely on the strong assumption that an elastic-linear decision boundary 
is appropriate. Therefore, elastic linear classifiers may yield inaccurate predictions 
when the assumptions are biased towards oversimplification and/or when the number 
of training examples is too low compared to the length of the time series. These 
findings are in line with those of linear classifiers in Euclidean space. 

6. Conclusion 

This paper introduces generalized gradient methods for learning on time series under 
elastic transformations. This approach combines (a) the novel concept of elastic func¬ 
tions that links elastic proximities on time series to piecewise smooth functions with 
(b) generalized gradient methods for non-smooth optimization. Using the proposed 
scheme, we (1) showed how a broad class of gradient-based learning can be applied to 
time series under elastic transformations, (2) derived general convergence statements 
that justify the generalizations, and (3) placed existing adaptive methods into proper 
context. Exemplarily, elastic logistic regression, elastic (margin) perceptron learning, 
and elastic linear SVM have been tested on two-class problems and compared to near¬ 
est neighbor classifiers using the DTW distance. Despite the simplicity in terms of the 
decision boundary and the computational efficiency, elastic linear classifiers perform 
convincing. There is still room for improvement by controlling elasticity and by apply¬ 
ing different forms of regularization. The results indicate that adaptive methods based 
on elastic functions may complement the state-of-the-art in statistical pattern recog¬ 
nition on time series, in particular when powerful non-linear gradient-based methods 
such as deep learning are extended to time series under elastic transformations. 
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A. Proof of Convergence Results for Elastic Linear 
Classifiers 

Since affine functions are convex and the maximum of convex functions is also convex, the 

elastic inner product is convex. In addition, the composition of convex functions is convex. 

Therefore the loss functions of elastic linear classifiers are convex. Then the first convergence 

results is shown in [24] . 
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To show the two other convergence statements, we assume that \T>\ = N. For each training 
example (xt,yi) £ T> the loss 


£ i (e)=£ i (y i ,b + a(x i ,W)) 

is real-valued and convex, where 6 = (W , b). Then there is a positive scalar Ci that bounds 
the subdifferential of £i at 6 for all i £ [N]. Suppose that 

C = max Ci. 

i=l,...,N 

Then from m , Prop. 2.2. follows that the incremental generalized gradient method converges 
to a local minimum. 

To show the Elastic Margin Perceptron Convergence Theorem, we assume that 

N 

E n {6]=Y,^(0) 

i= 1 

is the error without averaging operation, that is En = N ■ Rn■ By assumption, the training 
set V is elastic-linearly separable. Then the minimum value E * of En is zero. From m, 
Prop. 2.1. follows 

lim E n { 0‘) < E. + 

t—to o 2 2 

where y is the learning rate. Choosing y < £,/C 2 gives 

lim E n (0‘) < |. 

£—>■00 2 

Since ^ > 0, this implies that there is a to such that < £ for all t > to. Here, it refers 

to example ( x t ,yt ) £ T> presented at iteration t. From this follows that all training examples 
are classified correctly after a finite number of update steps, provided that A < £. ■ 
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